---
jupyter:
  jupytext:
    text_representation:
      extension: .md
      format_name: markdown
      format_version: '1.3'
      jupytext_version: 1.14.6
  kernelspec:
    display_name: Python 3 (ipykernel)
    language: python
    name: python3
---

# Querying a NifVector graph - Dutch

## Introduction

```python
import os, sys, logging
logging.basicConfig(stream=sys.stdout, 
                    format='%(asctime)s %(message)s',
                    level=logging.INFO)
```

## Querying the NifVector graph based on DBpedia


These are results of a NifVector graph created with 100.000 DBpedia pages. We defined a context of a word in it simplest form: the tuple of the previous multiwords and the next multiwords (no preprocessing, no changes to the text, i.e. no deletion of stopwords and punctuation). The maximum phrase length is five words, the maximum left and right context length is also five words.

```python
from rdflib import URIRef

database_url = 'http://localhost:3030/dbpedia_nl'
identifier = URIRef("https://mangosaurus.eu/dbpedia")
```

```python
from rdflib.plugins.stores.sparqlstore import SPARQLUpdateStore
from nifigator import NifVectorGraph

# Connect to triplestore
store = SPARQLUpdateStore(
    query_endpoint = database_url+'/sparql',
    update_endpoint = database_url+'/update'
)
# Create NifVectorGraph with this store
g = NifVectorGraph(
    store=store, 
    identifier=identifier
)
```


### Most frequent contexts of a phrase


The eight most frequent contexts in which the word 'has' occurs with their number of occurrences are the following:

```python
# most frequent contexts of the word "schrijver"
g.phrase_contexts("is gemaakt", topn=10)
```

This results in

```console
Counter({('gebruik', 'van'): 15,
         ('Het', 'door'): 12,
         ('SENTSTART Het', 'door'): 11,
         ('beeld', 'door'): 7,
         ('tekst', 'door'): 6,
         ('De tekst', 'door'): 5,
         ('SENTSTART De tekst', 'door'): 5,
         ('ei', 'van'): 5,
         ('muziek', 'door'): 5,
         ('en', 'door'): 3})
```

SENTSTART and SENTEND are tokens to indicate the start and end of a sentence.


### Phrase and context frequencies


The contexts in which a word occurs represent to some extent the properties and the meaning of a word. If you derive the phrases that share the most frequent contexts of the word 'has' then you get the following table (the columns contains the contexts, the rows the phrases that have the most contexts in common):

```python
import pandas as pd
pd.DataFrame().from_dict(
    g.dict_phrases_contexts("is gemaakt", topcontexts=8), orient='tight'
)
```

This results in:

```console
                gebruik  Het     SENTSTART Het  beeld 	tekst 	De tekst    SENTSTART De tekst 	ei
                van      door    door 	        door 	door 	door 	    door                van
is gemaakt      15       12      11             7       6       5           5                   5
is geschreven   0        79      78             0       22      12          11                  0
werd geschreven 0        19      19             0       14      10          10                  0
is              0        47      47             2       2       0           0                   0
werd gemaakt    53       2       2              2       0       0           0                   0
was             2    	 10      9              0       0       0           0                   0
werd            0        83      82             2       0       0           0                   0
```


### Phrase similarities


Based on the approach above we can derive top phrase similarities.

```python
# top phrase similarities of the word "has"
g.most_similar("is gemaakt", topn=10, topcontexts=15)
```

This results in

```console
{'is gemaakt': (15, 15),
 'is geschreven': (9, 15),
 'werd': (9, 15),
 'is': (8, 15),
 'werd geschreven': (7, 15),
 'wordt': (7, 15),
 'was': (6, 15),
 'werd gemaakt': (6, 15),
 'wordt gemaakt': (5, 15),
 'gemaakt': (4, 15)}
```

Now take a look at similar words of 'groter'.

```python
# top phrase similarities of the word "larger"
g.most_similar("groter", topn=10, topcontexts=15)
```

Resulting in:

```console
{'groter': (15, 15),
 'kleiner': (14, 15),
 'breder': (11, 15),
 'hoger': (11, 15),
 'lager': (11, 15),
 'beter': (10, 15),
 'langer': (10, 15),
 'meer': (10, 15),
 'sneller': (10, 15),
 'minder': (9, 15)}
```

```python
# top phrase similarities of the word "King"
g.most_similar("koning", topn=10, topcontexts=25)
```

This results in

```console
{'koning': (25, 25),
 'hertog': (22, 25),
 'keizer': (22, 25),
 'vorst': (21, 25),
 'prins': (20, 25),
 'graaf': (19, 25),
 'koningin': (18, 25),
 'Koning': (17, 25),
 'bisschop': (17, 25),
 'groothertog': (17, 25)}
```


Instead of single words we can also find the similarities of multiwords

```python
# top phrase similarities of Willem Alexander (King of the Netherlands)
g.most_similar("Willem Alexander", topn=10, topcontexts=15)
```

```console
{'Willem Alexander': (15, 15),
 'Filip': (8, 15),
 'Boudewijn': (7, 15),
 'Willem I': (7, 15),
 'Willem III': (7, 15),
 'Albert II': (6, 15),
 'George III': (6, 15),
 'Maximiliaan I Jozef van Beieren': (6, 15),
 'Willem II': (6, 15),
 'Christiaan IX van Denemarken': (5, 15)}
```


### Most frequent phrases of a context


Here are some examples of the most frequent phrases of a context.

```python
context = ("koning", "van Engeland")
for r in g.context_phrases(context, topn=10).items():
    print(r)
```

```console
('Eduard III', 21)
('Karel II', 21)
('Karel I', 17)
('Eduard I', 15)
('Hendrik VIII', 13)
('Jacobus II', 13)
('Eduard IV', 11)
('Hendrik VII', 9)
('Jacobus I', 9)
('Hendrik II', 8)
```

```python
context = ("de", "stad")
for r in g.context_phrases(context, topn=10).items():
    print(r)
```

```console
('grootste', 493)
('oude', 228)
('gelijknamige', 178)
('tweede', 120)
('Duitse', 116)
('Nederlandse', 116)
('belangrijkste', 105)
('huidige', 99)
('hele', 97)
('nieuwe', 91)

```


### Phrase similarities given a specific context

Some phrases have multiple meanings. Take a look at the contexts of the word 'middel':

```python
g.phrase_contexts("middel", topn=10)
```

This results in:

```console
Counter({('door', 'van'): 4956,
         ('door', 'van een'): 1516,
         ('door', 'van de'): 406,
         ('SENTSTART Door', 'van'): 398,
         ('Door', 'van'): 390,
         ('door', 'van het'): 264,
         ('worden door', 'van'): 142,
         ('een', 'om'): 135,
         ('die door', 'van'): 117,
         ('Door', 'van een'): 102})
```
It is possible to take into account a specific context when using the most_similar function in the following way:

```python
g.most_similar(phrase="middel", context=("door", "van"), topcontexts=50, topphrases=15, topn=10)
```

The result is:

```console
{'middel': (50, 50),
 'toedoen': (21, 50),
 'gebruik te maken': (19, 50),
 'toepassing': (15, 50),
 'gebruik': (11, 50),
 'leden': (10, 50),
 'toevoeging': (7, 50),
 'die': (6, 50),
 'doelpunten': (4, 50),
 'samenvoeging': (3, 50)}
```

```python
g.most_similar(phrase="middel", context=("een", "dat"), topcontexts=50, topphrases=15, topn=10)
```

In this case the result is:

```console
{'systeem': (5, 5),
 'apparaat': (4, 5),
 'programma': (4, 5),
 'proces': (3, 5),
 'teken': (3, 5),
 'bedrijf': (2, 5),
 'gebied': (2, 5),
 'tijd': (2, 5),
 'woord': (2, 5),
 'boek': (1, 5)}
```


### Phrase similarities given a set of contexts

If you want to find the phrases that fit a set of contexts then this is also possible.

```python
c1 = [
        c[0] for c in (
            g.phrase_contexts("dacht", topn=None) &
            g.phrase_contexts("vond", topn=None)
         ).most_common(15)
]
c1
```

This results in:

```console
[('hij', 'dat'),
 ('Hij', 'dat'),
 ('SENTSTART Hij', 'dat'),
 ('omdat hij', 'dat'),
 ('en', 'dat'),
 ('men', 'dat'),
 ('hij', 'dat de'),
 ('die', 'dat'),
 ('hij', 'dat hij'),
 ('dat hij', 'dat'),
 ('omdat men', 'dat'),
 ('hij', 'dat het'),
 ('omdat hij', 'dat de'),
 ('Hij', 'dat het'),
 ('SENTSTART Hij', 'dat het')]
```

```python
g.most_similar(contexts=c1, topn=10)
```

Resulting in:

```console
{'dacht': (15, 15),
 'meende': (15, 15),
 'vond': (15, 15),
 'stelde': (12, 15),
 'zei': (12, 15),
 'beweerde': (11, 15),
 'denkt': (11, 15),
 'geloofde': (11, 15),
 'vindt': (11, 15),
 'wist': (11, 15)}
```

```python

```

```python

```